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Abstract 


This paper develops automatic song transla- 
tion (AST) for tonal languages and addresses 
the unique challenge of aligning words’ tones 
with melody of a song in addition to conveying 
the original meaning. We propose three criteria 
for effective AST—preserving meaning, singa- 
bility and intelligibility—and design metrics 
for these criteria. We develop a new benchmark 
for English—-Mandarin song translation and de- 
velop an unsupervised AST system, Guided 
AliGnment for Automatic Song Translation 
(GagaST), which combines pre-training with 
three decoding constraints. Both automatic and 
human evaluations show GagaST successfully 
balances semantics and singability. 


1 Introduction 


Suppose you are asked to translate the lyrics “Let it 
go” from the Disney musical Frozen into Mandarin 
Chinese. Some good, literal translations of this 
would be A) “fang shéu’, B) “fang shéu ba” or C) 
“rang ta qu ba” (Figure 1); these get the meaning 
across and are the domain of traditional machine 
translation. However, what if you needed to sing 
this song in Mandarin? These literal translations 
simply do not work: Translations A and C do not 
match the number of notes and break the original 
rhythm; while the tones of Translation B does not 
match the pitch flow of the original melody. 


Song translation, unlike translation lyrics for under- 
standing (subtitling), aims to translate the lyrics so 
that it can be sung with the original melody. There- 
fore, the translated lyrics must match the prosody 
of the pre-existing music in addition to retaining 
the original meaning. In Singable Translations 
of Songs, Low (2003) says, this is an uncommon 


i VT VT 
Voice on ete oe = ota toy = 
Let it |go. Let it |go. 
~= ~= 
Google fang shču | ba fang shõu | ba 
Translate (ùk P E Ù F E 
VT Vv 
Baidu rang ta |qu ba rang ta |qu ba 
Translate jit € jd E ee | 
~ ~ 
Human fang shou fang shou 
Lyrics translation (ùk P |= ix F |= 
Vv Vv 
Human sui ta |ba sui ta | ba 
Song translation |p 4b E pë Ae [E 


Transition direction of successive notes/tones by pitch level: - up , =œ down 


Figure 1: Example Mandarin translations for “Let it 
go” in Frozen. Of these, only the official human song 
translation is something a singer could actually sing: it 
fits the length of the notes and matches the tones with 
the pitch of notes. GagaST finds translations that satisfy 
these constraints. 


and an unusually complex task, a translator con- 
sider rhythm, notes’ pitches, phrasing, and stress. 
Nonetheless, there are cultural and commercial in- 
centives for more efficient song translation; Frozen 
alone made over a half a billion dollars in non- 
English box office receipts! and the musical Les 
Misérables has been performed in over a dozen 
languages on stage. 


As we discuss in Section 2, while translating West- 
ern songs resembles poetry translation, translat- 
ing into tonal languages (e.g., Mandarin, Zulu and 
Vietnamese) introduces new problems. In tonal lan- 
guages, a word’s pitch contributes to its meaning 
(Figure 2); when singing in tonal languages, the 


‘https: //www.the-numbers.com/movie/ 
Frozen- (2013) #tab=international 
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Tone 1: 4 sha (uncle) 

Tone 2: #4 shú (cooked, familiar) 
Tone 3: fit shù (mouse, Muroidea) e 
Tone 4: #} shu (tree) 


0 25 50 75 100 
Normalized Time (%) 


Figure 2: In total languages like Mandarin, the pitch 
changes the meaning of the words (left). Each of the 
four tones in Mandarin (right) has a different pitch pro- 
file. Figure from Xu (1997). 


tones of translated words must align with the “flow” 
of the pitches in the music (Section 2.1). For exam- 
ple, if “fang shéu” were sung instead of “fang shou” 
(because notes are going up), a listener might hear 
“defensive” instead of the intended meaning. 


This paper builds the first system for automatic 
song translation (AST) for one tonal language— 
Mandarin. Section 3 proposes three criteria— 
preserving semantics, singability and intelligibil- 
ity—needed in an AST system. 


Guided by those goals, we propose an unsupervised 
AST system, Guided AliGnment for Automatic 
Song Translation (GagaST). GagaST begins with an 
out-of-domain translation system (Section 4.1) and 
adds song alignment constraints that favor trans- 
lations that are the appropriate length and whose 
tones match the underlying music (Section 4.2). 
Naturally, such constraints trade-off between se- 
mantic meaning and singability/intelligibility. Sec- 
tion 5.4 discusses this trade-off between song align- 
ment scores and the standard machine translation 
metric, BLEU. 


These criteria also form the evaluation for our ini- 
tial evaluation (Section 5.3). However, we go be- 
yond an automatic evaluation through a human- 
centered evaluation from musicology students. 
GagaST creates singable songs that make sense 
given the original text, and our proposed align- 
ment scores correlate with human judgements (Sec- 
tion 5.4.3).? 


2 Background: Prose, Poetry, and Song 
Translation 


A spoken language can be divided into two forms: 
prose, which corresponds to natural conversa- 


>Examples of translated songs by GagaST at https: // 
gagast.github.io/posts/gagast. 


Original Lyrics 
(Inconsistent Tone) 


Misheard Lyrics 
(Consistent Tone) 


$ — = 5 s = 5 
z z 
——— —SS— 
si œ zai yan qian si æ zai yan qian 
Ww Æ R BY ye ÆR BY 


appear where eye front death where eye front 


As if before my eyes 


Inter-syllable pitch alignment score: 0.5 


Die before my eyes 


Inter-syllable pitch alignment score: 0.75 


Figure 3: If a song’s music doesn’t match the tones 
of the lyrics, it can cause the hearer to misunderstand 
the lyrics. In this example, someone can hear “si zai” 


instead of “si zai”, because the notes are going up and 
Aca 


“sì zai” is going down. 


tion and conventional grammatical structure; and 
verse—typically rhythmic and broken into stanzas— 
such as poetry and song lyrics. 


The vast majority of machine translation research 
has been focused on prose translation and has made 
huge progress; in contrast verse translation is more 
difficult as it must obey the rhythmic constraints 
and is less developed. In his tour de force work Le 
Ton Beau de Marot, Douglas Hofstadter presents 
eighty-nine translations of a single poem to capture 
the panoply of considerations of what makes the 
task difficult (Hofstadter, 1997). 


In western verse, the rhythmic structure are mostly 
defined by meter, such as the iambic pentameter 
for sonnets, which defines the length of each line, 
the patterns of long syllables versus short ones 
and the stressed ones versus weak ones. Existing 
work (Greene et al., 2010; Ghazvininejad et al., 
2018) use finite-state constraints to encode both 
meter and rhyme. 


Song translation, on the other hand, can be viewed 
as a translation where the melody defines the con- 
straints. Reproducing all of the essential values of 
a song—perfectly matching the meaning, perfectly 
singable, and perfectly understandable—is an im- 
possible ideal (Franzon, 2008). Thus, tradeoffs are 
unavoidable. Low (2003) argues for prioritizing 
singability over other qualities such as sense and 
rhyme since “effectiveness on stage” is a practical 
necessity. Tonal languages (e.g., Mandarin, Zulu 
and Vietnamese) dramatically increases the com- 
plexity of singability, and introduces a new factor 
that could hamper intelligibility. 
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2.1 Song Translation for Tonal Languages 


For tonal languages, pitch contributes to the mean- 
ing of words. In a conservative estimation, fifty 
to sixty percent of the world’s languages are 
tonal (Yip, 2002) and cover over 1.5 billion people. 
For the lyrics to be intelligible, the speech tone 
and music tone should be correlated (Schneider, 
1961). If not, the pitch contour could override the 
intended tone, which could produce different mean- 
ings. This is not just a theoretical consideration; 
Figure 3 shows how lyrics can be and have been 
misunderstood.’ 


2.2 Mandarin Tones and how to Sing them 


Schellenberg (2013) summarizes the rules of 
singing with tone with a focus on Chinese dialects. 
The tonal system of Mandarin has two components: 


e The pitch level and shape of tones. Four Man- 
darin tones are used since the 19" century (Fig- 
ure 2). We denote tones with a diacritic over the 
vowel whose shape roughly matches the shape 
of the tone. The four tones are a high level 
(tone 1, e.g., shtio), rising (tone 2, yu), falling- 
rising (tone 3, wò) and falling (tone 4, huai). 

e The sandhi of tones. Some combinations 
of tones have difficult articulatory patterns, so 
words that might normally have one tone might 
take another depending on the context. For ex- 
ample Sar” (you) and “hao” (good) are typically 
both third tone, but when they are together it is 
pronounced as “ní hao” (hello), with the first syl- 
lable changing to a second tone. These changes 
are called sandhi (Xu, 1997; Hu, 2017). 


Mandarin tones interact with a sung melody in two 
ways (Yinliu et al., 1983; Schellenberg, 2013) to 
ensure lyrics are intelligible. First, at a local level, 
the shape of tones of individual syllables should be 
consistent with the musical notes they are matched 
with; for example, in “Love Island” (Figure 4), 
“shang” in the blue block has the “falling” shape 
and the group of notes it assigned to it also falls 
from an A to a E. Second, and a global level, the 
music’s pitch contour should align with the tones 
of the corresponding syllables (taking sandhi into 
account). In practice, we align the transitions be- 
tween successive syllables and successive notes 
(Figure 5) ensuring that the tone matches the rela- 
tive pitch change (Schellenberg, 2013). 


3More examples at https://gagast.github.io/ 
posts/gagast/#misunderstanding_examples 


REST: intervals of silence that usually align with word segmentations or punctuation 


== = = ae 
z ee 
wo yi] |wang ji| | wō céng | hu gua] | ain le | | gang zài ji 
REY wid | KB EA] FT| wi TE 
Ihave forgotten (that) I've lived. (I've) lost (my) ser tanding above 
= A n a , m 
i m aa te . at fe 
g peha a= p= 
io 
lang dà häi zht_|shang Xx wo yang tou wang angxiang gèn gü washéngde yuè liang hei àn 
ik K eZ |b RMAF 2 Bow mA se 2B HE 
rk 


stormy sea (above). 1 look up “look upto the eternal silence moon. Dar 


One character (syllable) aligns 
with a group of multiple notes 


One character (syllable) aligns 
with a single note 


Figure 4: The output of a song translation needs to align 
syllables to the reference melody. There are several 
options, as evinced by the song “Love Island (xin dio)”. 
Orange (top): REST notes; Blue (bottom left): one 
syllable is assigned to a group of multiple notes (which 
needs tone shape alignment: the down arrow matches 


with falling tone of “rang”’); Green (bottom right): one 


syllable is assigned with one note. 


3 AST for Tonal Languages 


This section formally defines automatic song trans- 
lation (AST) for tonal languages and introduce three 
criteria for what makes for a good song translation. 
These criteria form the foundation for the quantita- 
tive metrics we use in the experiment. 


3.1 Criteria 


There are three criteria that a singable song transla- 
tion needs to fulfil. 


Preserve meaning. The translated lyrics should 
be faithful to the original source lyrics. 


Singability. Low (2003) defines singability as 
the phonetic compatability of translated lyrics and 
music. The translated song needs to be sung with- 
out too much difficulty; difficult consonant clusters, 
cramming too many syllables into a line, or incom- 
patible tones all impair the singability. 


Intelligibility. The translated song need to be un- 
derstood by the listener. This quality has two com- 
ponents. First, could a listener produce any tran- 
scription of the lyrics. If the lyrics are too fast or 
garbled because the keywords do not fit well with 
the music, the lyrics are unintelligible. Beyond this 
basic test of recognizability, the lyrics must also 
be accurate: does this transcription match the in- 
tended meaning. Both aspects matter for a stage 
performance, since the audience should understand 
the content to follow the plot. For pop songs, not 
understanding all contents could be acceptable for 
some audiences; for example, Adriano Celentano’s 
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notes A3 C4 D4 REST F4 G4 F4 F4 
pitch level 57 60 62 65 67 65 65 
duration + + 2 1 3 3 7 1 
syllables How a- bout love? 


Table 1: A snippet of the song “Seasons of love” from 
the musical Rent that shows the input into GagaST. Notes 
are converted into integer pitches with a duration, and 


66a? 


syllables are aligned to notes: the “a” from “about” has 
one note but “love” has four. 


Prisencolinensinainciusol sacrifices all intelligibil- 
ity for singability (Bellos, 2013). However, in more 
traditional media, hilarious misheard lyrics can ruin 
the audience’s experience (Figure 3). 


3.2 Task Definition 


We define the AST task as follows: given an aligned 
pair of melody M and source lyrics X, generate 
translated text Y in the target language that aligns 
with the input melody M. 


Specifically, X = [z1, ..., £z] are the input lyrics 
with L syllables. Each syllable x; is aligned to a 
snippet of the melody (Table 1) represented by a 
sequence of notes. To represent this to our algo- 
rithm, each syllable is aligned to three components 
of the melody: 


1. A sequence of pitch values p; = [p?,...] with 
|p;| > 1 where an integer value of 1.0 means a 
semitone (e.g., between C and C-sharp). 

2. The duration of those notes d; = [d?, ...], where 
1.0 is a quarter note. Because it encodes the 
duration of each note, the length of d; must be 
the same as the length of p;. 

3. Sometimes there is a rest (pause) before a lyric 
is sung. We align this to the following syllable i. 
The scalar r; is the real-valued duration of the 
REST note before note group p;. If no REST 
exists before pi, ri = 0.0. 


3.3 Constraints for Aligning Lyrics to Music 


To make translated songs singable and intelligible, 
we summarize three desirable properties of that the 
AST lyric outputs should have if they are to match 
the underlying melody. Each of these induces a 
score function which we will use both in our ob- 
jective functions for constrained translation and for 
our evaluation metrics. 


3.3.1 Length Alignment 


The number of syllables L, in translated lyrics Y 
need to match the number of groups of notes p; in 


the melody M, so that it can be sung with the music. 
Within the scope of this paper, we either keep the 
original grouping in the melody M and have L, = 
Lz for reproducing the original music; or strictly 
produce one target syllable for each single note in 
the melody. 


3.3.2 Pitch Alignment 


For tonal languages, pitch of the music must match 
the lyrics. As in Section 2.2, there are two types of 
pitch alignments: 1) intra-syllable, the tone shape 
of each syllable (Figure 4 blue box) should align 
with the shape of the assigned group of notes; 2) 
inter-syllable, the overall pitch contour of the music 
phrase should align with the tones of lyrics. 


Intra-syllable alignment. For an individual syl- 
lable, if it is assigned to more than one note (e.g., 
“love” in Table 1), those notes must be consistent 
with the shape of the syllable’s tone (Wee, 2007). 
For Mandarin, there are four tones (Xu, 1997, Fig- 
ure 2). We estimate the shape of the multi-note 
sequence p; by least-square estimation and classify 
it into one of five categories: level, rising, falling, 
rising-falling, falling-rising. 


Specifically, for each group p; that |p;| > 1, we 
classify it as, 


1. “level”, if o. — Pinin < 1.0; otherwise, 
we fit p; into ax? + br + c via least-square 
estimation, and compute the axis of symmetry 
l = —b/2a, 

2. “rising”, if (l < p? and a > 0.0) or (l > p” 
and a < 0.0); 

3. “falling”, if (J < p? and a < 0.0) or (l > p7 ' 
and a > 0.0); 

4. “rising-falling”, if p? < l < p” and a < 0.0; 

5. “falling-rising”, if p? <x p; and a > 0.0; 


We compare the shape with that of syllable y;, and 


compute the intra-syllable alignment score Si itia 


; 1.0 if the shape matches, 
Sintra = 0) 
€ otherwise, 

where e is a small parameter that allows for mis- 
matches. Of the five patterns, “level” can match 
with any tone, “rising” matches with tone 2 (yt), 
“falling” matches with tone 4 (huai), “falling-rising” 
matches with tone 3 (wò) while “rising-falling” 
matches no Chinese tones. 
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Next Tone 

Prev Tone 1s: feng 2: huang 3%: wò 4*: shi 
1s: féng >ay Ý >a \> y 
2%; huang >y >y \> y 
3": WO yA yA a> A 
4%: shi A A a> 4D 

@ 

Sh PS SAH 
@ oe = oe 

Jump Step Level Step Jump 
Down Down Up Up 


Figure 5: For translated songs in Mandarin to be 
singable, music notes should align the tones of suc- 
cessive characters; this becomes our inter-syllable 
pitch alignment. The arrows show acceptable tran- 
sitions in music for two successive Mandarin characters 
(w;_1, wi) based on the shape of Mandarin tones includ- 
ing sandhi. 


Inter-syllable alignment. The second constraint 
compares the transition directions between consec- 
utive tones (¢;_1, ti) of successive syllables (yj_1, 
yi) that belong to the same word (see arrows in 
Figure 3). These must match the transition direc- 
tions of music notes (p;_1, pi).4 Each transition 
(the movement from one syllable/note to the next) 
can be categorized as level, step up, jump up, step 
down and jump down. We summarize the accept- 
able transitions for each pair of successive syllables 
in Figure 5 based on analysis by Yinliu et al. (1983) 
and we discuss our choices with more details in 
Appendix A.2. Given two syllables (y;—1, yi), we 


compute the local pitch contour S' inter 


gi 1.0 if contour matches, 


= . (2) 
inter € otherwise, 


where € again is a small value to allow mismatches. 


3.3.3 Rhythmic Alignment with Word 
Segmentation in Mandarin 


A musical REST is a silence separating music. Re- 
call that in our setup of the data, a scalar r; denotes 
if a note precedes syllable 7. In any language, it is 
uncommon for a rest to break up a word’s syllables. 
Thus a good translation should avoid this. For Man- 
darin, creating metrics that capture this are slightly 


4We compute the directions of two notes group (pi-1, pi) 
by the first notes (p?_,, p?) for simplicity. 


more complicated because translation systems typ- 
ically do not explicitly generate word boundaries. 
Thus, we must rely on the output of segmentation 
systems to know where word boundaries are. 


An exception to this is punctuation (Figure 4). If 
a comma, period, or other punctuation is attached 
to the previous syllable y;_;, then that is a clear 
signal that it’s fine to pause between them. Thus, 
our rest score a syllable y; following y;_, that are 
part of different words with probability Per. the 
rest score is: 


1.0 ifr; > 0.0 and [punc] after yi—1, 
1.0 ifr; =0.0, 
Peg 


€ otherwise, 


if r; > 0.0 and not [punc], 


(3) 
where e is a parameter that represents our tolerance 
of having a rest within a word. 


4 GagaST 


Ideally, we would build an AST system for English- 
Mandarin song translation with data-driven mod- 
els from parallel data, i.e., aligned triples (M, X, 
Y). However, these data are not available in the 
quantity or quality necessary for Mandarin: there 
is not enough data of any quality, and those that 
do exist have errors in the syllable-notes align- 
ment. Thus, we propose an unsupervised AST 
system, Guided AliGnment for Automatic Song 
Translation (GagaST). For the pre-training, we col- 
lect non-parallel lyrics data in both English and 
Mandarin, as well as a small set of lyrics transla- 
tion data (Section 5.1). 


4.1 Song-Text Style Translation 


To produce faithful translations in song-text style, 
we pre-train a transformer-based translation model 
with cross-domain data: translation data in the gen- 
eral domain, the collected monolingual lyrics data, 
and a small set of lyric translation data. We append 
domain tags (Figure 6) before each input exam- 
ple to control the model to produce translations 
merely in lyrics domain during song translation. 
For monolingual lyrics data, we adopt BART pre- 
training (Lewis et al., 2020). 


4.2 Music Guided Alignment Constraints 


Without available parallel data to learn the lyric- 
melody alignments, we impose constraints (Sec- 


>In practice, we use the cut output by the Jieba toolkit. 
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Step 1: Pretraining 


YA HIS ELSE FR. 
Y3 Rolling in the deep 
YQ Aes A E 

Yi RIFKI LAA 


T4 [2zh] IGEN] [LEN9] E Even the United States .. General Translation 

T3 i [2en] [LYRICS] [LENA] Rolling in in 
T2 m 
zı TEMASIN Ween ial Tan 


de 
ri NonParallel Lyrics 


Lyrics Translation 


([lang tag] [domain tag] [length tag] input texts . . . ) 


_Step 2: Inference 


C4 D4 note | 
62 oM s o> i 60 62 pitchi 


Avacnitainsacavensdscaueiaascssuasssuunssduniauadsasnnitssussiunssiadisisenrisdasiaandaaanared 


+ constraints = 
in beam search gu 


output 


[s] Pitch alignment score in each beam with constraints 


[s] Pitch alignment score in each beam w/o constraints 


* Scores in this figure are not exact, merely for illustration 


Figure 6: Overview of GagaST for English-Mandarin song translation. We first pre-train a lyrics translation model 
with mixture domain data (left); and then add alignment constraints in decoding scoring function during inference 
(right), we use unconstrained version as our baseline in the experiment. 


tion 3.3) in the decoding phase. Specifically, since 
all constraints are applied at the unigram (intra- 
syllable, REST) or bigram (inter-syllable, REST) 
level, we apply them at each step of beam search 
as rewards and penalties in the scoring function: 


L 
logP(Y | X, M) = X` [log P(yi | yi-1:0, X) 
i=0 


+ Àinter log Sia + Aintra log Sintra 
E Ar log SR], (4) 


where Sinter: Sintra: and Sp refer to the align- 
ment scores for inter-syllable pitch alignment, intra- 
syllable pitch alignment and the rhythm alignment 
by REST. We introduce three tunable parameters— 
Ainter> Aintra» and Ar—that control the impor- 
tance of each of the song-specific constraints. 


4.3 Length Control in Pre-training 


To meet the length constraints, we pre-define the 
syllable-notes assignments with two strategies: 1) 
note-to-syllable, i.e., for each note, we produce one 
syllable; 2) syllable-to-syllable, we use the original 
notes grouping in the input melody, and assign 
one syllable to each note group. In this case, the 
length of target translation is known. Following 
Lakew et al. (2019), we use length tag “[LEN$i]” 
to control the length of outputs during pre-training, 
where $i refers to the length of the target sequence. 
aN dynamic mapping between the note sequence and the 


syllables changes the original rhythm and increases the search 
space exponentially. We leave this to future work. 


5 Generating Melody-constrained Lyrics 
and Validating Singability 


This section details data sets, model configuration, 
and proposed evaluation metrics. Then we ana- 
lyze the results and the trade-offs inherent in song 
translation. Our code and data are open-sourced at 
https://github.com/GagaST. 


5.1 Training Datasets and Model 
Configuration 


WMT dataset: news commentary and back- 
translated news datatsets from WMT14 (29.6 mil- 
lion en2zh sentence pairs). No Cantonese texts 
included and the official Chinese texts can be pro- 
nounced in Mandarin by default. 


Monolingual lyrics data: monolingual lyrics in 
both Mandarin and English collected from the web 
(12.4 million lines of lyrics for Mandarin and 109.5 
million for English after removing duplicates). 


Lyrics translation data: a small set of lyrics 
translation data crawled from the web ’ (140 thou- 
sands pairs of English-to-Mandarin lines). These 
translations are not singable. 


We preprocess all training data with fastBPE (Sen- 
nrich et al., 2016) and a code size of 50,000. We 
use encoder-decoder Transformer (Vaswani et al., 
2017) with 768 hidden units, 12 heads, GELU acti- 
vation, 512 max input length, 12-12 layers structure 
(Appendix B for more details). 


Thttps://lyricstranslate.com/ 
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5.2 Evaluation Datasets 


For evaluation, we need aligned triples (melody M, 
source lyrics X, target reference lyrics Y), where 
two conditions hold: 1) M and X are syllable-to- 
note aligned; 2) the reference Y should be singable 
and intelligible. With the confluence of digitiza- 
tion and copyright making such resources rare, 
we choose fifty songs from the lyrics translation 
dataset that have open-source music sheets on the 
web and create aligned triples manually. However, 
the reference lyrics in this dataset are not singable 
(our primary goal!), we use them to validate that the 
translations preserve the original meaning. Twenty 
songs comprise the validation set (464 lines) and 
thirty songs comprise the test set (713 lines). 


5.3 Evaluation Metrics 


An AST system for tonal languages should generate 
translated songs that are singable and intelligible 
while conveying the original meaning. Evaluating 
such system is an intrinsically hard task since all 
three qualities can be qualitative. Especially for 
preserving meaning, the lack of gold references 
and the greater tolerance for a loose translation in 
songs make it difficult to say how much semantic 
divergence is acceptable. Therefore, we first estab- 
lish evaluations based on the relationship between 
lyrics and music and then design human annota- 
tions for more qualitative evaluations. 


5.3.1 Objective Evaluation 


Section 3.3 outlines three constraints inspired by 
music and linguistic theory. Because these con- 
straints are directly incorporated into the decoding 
objective (Equation 4), these will be better than an 
unconstrained translation. However, we want to un- 
derstand the trade-off between these new objectives 
and traditional translation evaluations. 


To control for the length of the sentence, we nor- 
malize the score to 0-1.0 by the length of alignment 
pairs L;, that is, based on Equation 1,2 and 3, 


sij X Sy /Lis (5) 


1 


For the length constraint, we compute: 1) Nj, 
the number of samples that has length longer than 
the predefined length L;; 2) Ns, that are shorter 
than L;. For each case we compute the average 
error ratio of { Al; / pee . For meaning, although 
we lack gold singable translations, we follow the 


common practice and calculate BLEU (Papineni 
et al., 2002) between the translated songs and the 
prose translation. 


5.4 Trade-offs between Meaning and 
Melody-lyric Alignments 


GagaST adds constraints in the decoding scoring 
functions to enforce lyric-music alignments; how- 
ever, there are trade-offs between preserving mean- 
ing and adhering to these constraints. To select the 
importance of these constraints in decoding, we 
vary the value of the corresponding parameter 
(Equation 4) and analyze how much the BLEU score 
falls on the validation set as we increase the influ- 
ence of the parameter. We set the hyper-parameters 
where the alignment scores increase fast while the 
BLEU decreases slowly. The REST constraint does 
not affect the BLEU (Table 2) but does alter am- 
mount of punctuation. Working off the assumption 
that excessive punctuation is bad, we select a pa- 
rameter that minimizes the mismatches between 
the REST and word boundaries. We choose (Fig- 
ure 7) Ainter = 9-53 Aintra = 1-0; AR = 1.5 for 
all subsequent experiments. 


5.4.1 BLEU Evaluation 


Table 2 compares GagaST as we ablate constraints 
with our two syllable to note alignment strate- 
gies (Section 4): note-to-syllable and syllable- 
to-syllable. As in previous work, the length tag 
“TLEN$i]” helps lyrics fit the notes available. In 
all cases, less than 30 out of 713 lines produces a 
longer sentence with ratio less than 0.22; and no 
short cases. Thus, because it most closely resem- 
bles prior work in controlled translation and works 
well in this task, we adopt GagaST with only length 
tags and no other constraints as our baseline. With 
all of the constraints, GagaST indeed increases both 
pitch and rhythm alignments. It almost doubles 
the pitch contour alignment score, which affect the 
intelligibility the most. 


However, these gains come at the cost of BLEU 
score. While we believe that the audience would 
be more accepting of a less-than-literal translation 
in a song if it sounds better, we need a qualitative 
evaluation to validate that hypothesis. 


Due the paucity of reliable references, BLEU scores do not 
correlate with human judgement. For example, three official 
Disney Mandarin song translations have a lower BLEU score 
(12.3) than our more literal but demonstrably worse automatic 
translations. 
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Figure 7: Trade-off between meaning (y-axis) and lyric-music alignments (x-axis) while adjusting the tuning 
parameter À on the validation set. The selected value for the tuning parameter \ for downstream experiments is 
shown in red (preceeded by À =). REST constraints do not affect BLEUs, but increase the number of [punc]s, which 
impairs the fluency of the lyrics, thus we select its parameter based on number of [punc]s. 


Syllable-notes Model Pitch Rhythm Length Meaning 
Assignment inter? intrat avg # of missed rests} longer}  shorter| BLEUT 
GagasT w/o constraints 0.28 - 0.53 9 (0.09) 0 24.0 
naleaaeuiable GagaST 0.51 - 0.31 26 (0.21) 0 16.9 
y —only inter-syllable 0.51 - 0.45 26 (0.21) 0 16.8 
—only rest 0.28 - 0.31 11 (0.09) 0 23.8 
GagaST w/o constraints 0.29 0.49 0.62 4 (0.12) 0 22.1 
GagaST 0.50 0.55 0.28 13 (0.13) 0 15.9 
syllable-to-syllable —only inter-syllable 0.51 0.50 0.42 7 (0.12) 0 15.8 
—only intra-syllable 0.29 0.56 0.44 4 (0.12) 0 21.6 
—only rest 0.29 0.49 0.28 5 (0.12) 0 21.6 


Table 2: Our song-specific constraints with two syllable alignment techniques. All results here use the same 
pre-training checkpoint and length tags are applied. For length score, 9 (0.09) means that 9 out of 713 samples are 
longer than the predefined length with an average ratio 0.09. All constraints have an effect, but inter-syllable pitch 


alignment has the largest. 


5.4.2 Qualitative Evaluation 


The true test of whether AST works is whether the 
songs can be sung, understood, and enjoyed. Thus, 
we follow Sheng et al. (2021) and show annotator 
from a music school students the resulting sheet 
music, ask their opinion, and ask them to sing the 
songs. We randomly select five songs from the test 
set and show the music sheets (see Appendix C) 
of the first ten sentences of each translated song to 
five annotators. 


Following mean opinion score (Rec, 1994, MOS) 
in speech synthesis, we use five-point Likert scales 
(1 for bad and 5 for excellent). And we evalu- 
ate the songs on four dimensions: 1) sense, fi- 
delity to the meaning of the source lyric; 2) style, 
whether the translated lyric resembles song-text 
style; 3) listenability, whether the translated lyric 
sounds melodious with the given melody; 4) in- 
telligibility, whether the audience can easily com- 
prehend the translated lyrics if sung with provided 
melody. The last two dimensions require the anno- 
tators to sing the song. 


Model Song sense style listenability intelligibility 

Song 3.4 3.0 3.2 3.4 
Song2 3.6 3.9 3.4 3.8 
GagaST Song3 3.7 3.6 3.4 3:5 
w/o constraints Song4 3:2 3.0 2.8 3.0 
Song5 3.7 3.6 3.4 3.8 

Average 3.540.14 3.440.14 = 3.2 40.12 3.5 40.13 
Song 3.5 3.1 33 3.5 
Song2 3.4 3.7 3.5 4.0 
Song3 3,2 3.6 3.3 3.6 
Gapasl Songd 2.9 3.0 3.1 3.5 
Song5 3.4 3.6 3.2 39 

Average 3.340.15 3.440.15  3.340.12 3.7 +0.13 


Table 3: Qualitative evaluation results for GagaST w/o 
constraints and GagaST. 


5.4.3 Qualitative Evaluation Results 


To examine whether the proposed constraints im- 
prove the singability and intelligibility, our qualita- 
tive evaluation compares GagaST with only length 
constraints to fully constrained GagaST (Table 3) 
with syllable-to-syllable assignment. While the 
constraints significantly improve the intelligibil- 
ity and slightly improve the singability (listen- 
ing experience), these constraints make it harder 
for the original meaning to come through. Over- 
all, the annotators are satisfied with the translated 
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songs by the proposed baseline GagaST. All as- 
pects receive an average score around 3.5 out of 
5. These case studies and three translated songs by 
GagaST sung by an amateur singer are available on 
https://gagast.github.io/posts/gagast. 


6 Related Work 


Verse Generation and Translation. Generating 
verse text began through rule-based implementa- 
tions (Milic, 1970) and developed through the next 
forty years. Manurung (1999) design a chart sys- 
tem that generate strings that match a given stress 
pattern. Gervas (2000) build a forward reasoning 
rule-based system. Manurung (2004) address po- 
etry generation with stochastic search based on 
evolutionary algorithms. Oliveira (2012) create a 
template-based platform that allows user to define 
features and create templates. He et al. (2012) 
adopt statistical machine translation models for 
Chinese poetry generation. Yan et al. (2013) com- 
pose poetry based on generative summarization 
framework. Zhang and Lapata (2014), Wang et al. 
(2016), and Hopkins and Kiela (2017) adopt re- 
current neural networks for poetry generation and 
incorporate rhythmic constraints. Ghazvininejad 
et al. (2016, 2017) represent rhythm and rhyme 
with finite-state machines. Poetry translation us- 
ing these frameworks and statistical machine trans- 
lation thus offers elegant solutions: Genzel et al. 
(2010) intersect the finite state representation of 
the meter and rhyme scheme with the synchronous 
context-free grammar of the translation model un- 
der the phrase-based machine translation frame- 
work. Ghazvininejad et al. (2018) apply the finite- 
state constraints to neural translation model. How- 
ever, these representations of the rhythmic and lex- 
ical constraints are not flexible enough to encode 
the real-valued representation of a song as required 
for translation in tonal languages. 


Constrained Text Generation. Most natural lan- 
guage generation tasks, including machine transla- 
tion (Bahdanau et al., 2015; Vaswani et al., 2017; 
Hassan et al., 2018), dialogue system (Shang et al., 
2015; Li et al., 2016; Wang et al., 2021) and abstrac- 
tive summarization (Rush et al., 2015; Paulus et al., 
2018), are free text generation. However, there is 
a need to generate text with constraints for spe- 
cial tasks (Lakew et al., 2019; Li et al., 2020; Zou 
et al., 2021). Hokamp and Liu (2017); Post and 
Vilar (2018); Hu et al. (2019) attempt to constrain 
the beam search with dictionary. In the training 


procedure, Li et al. (2020) add format embedding. 
Lakew et al. (2019) introduce length tag. Saboo 
and Baumann (2019) address length control via 
rescoring the results of beam search for machine 
translation under dubbing constraints. 


Lyrics Generation. As one of the most impor- 
tant tasks in automatic songwriting, lyrics gener- 
ation has received more attention recently. Sheng 
et al. (2021), Lee et al. (2019) and Chen and Lerch 
(2020) generate lyrics via pure data driven models 
without adding constraints based on expert knowl- 
edge. Oliveira et al. (2007) build a rule-based lyrics 
generation system to handle rhyme and rhythm 
with designed heuristics. Malmi et al. (2016) ad- 
dress rap lyrics generation via information-retrieval 
approach and propose a rhyme-density measure. 
Watanabe et al. (2018) add conditions in stan- 
dard RNNLM with a featurized input melody for 
rhythmic alignment. Ma et al. (2021) develop a 
SeqGAN-based lyrics generator to address various 
properties, such as rhythmic alignment, theme and 
genre. Xue et al. (2021) use transformer-based 
model to generate rap lyrics with a reverse order, 
address rhymes with vowel embeddings and add 
extra beat tokens for rthymic alignment. We are 
the first paper that formally address the importance 
of aligning melody pitch with languages tones in 
lyrics generation for tonal languages. We introduce 
two vital qualities of songs, singability and intel- 
ligibility, and design three types of melody-lyric 
alignment scores to improve the two qualities. 


7 Conclusion 


This paper addresses automatic song transla- 
tion (AST) for tonal languages and the unique chal- 
lenge of aligning words’ tones with melody. And 
we build the first English-Mandarin AST system — 
GagaST. Both objective and subjective evaluations 
demonstrate that GagaST successfully improves the 
singability and intelligibility of translated songs. 


More constraints are left in the future work such 
as rhymes and style. We aim to build a systematic 
framework that address all constraints. With the 
help of newly developed singing voice synthesize 
tools such as X Studio,? we can perform human 
evaluation with actual singing voice with a larger 
scale to provide more reliable analysis. Moreover, 
our system can also be applied in lyrics and song 
generation applications without translation input. 


*https://singer.xiaoice.com 
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Ethical Considerations 


GagaST improves singability and intelligibility of 
the translated songs in Mandarin via constrain- 
ing the decoding of a pretrained lyrics translation 
model. This methodology has limitations by im- 
posing a direct trade-offs between the original ob- 
jective and the constraints. In terms of negative 
impact or risks, the inaccurate translations may 
cause misunderstandings in applications like Musi- 
cal Theatre. 


This paper collects lyrics data that are publicly 
available and are parsed from the Web. We use 
these data for research purposes only. To pre- 
vent any abuse or piracy of these data, we chose 
the dataset license Attribution-NonCommercial- 
ShareAlike 4.0 International (CC BY-NC-SA 4.0). 
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Appendix A 


A.1 Illustration of Tonal Alignment by 
Frequency 


Translating songs into tonal languages faces a 
unique challenge,i.e., the tones of the translated 
lyrics should align with the music pitch for singa- 
bility and intelligibility (Section 2.1). Figure 8 
provides visual illustration of the main problem. 


To help researchers who speak non-tonal languages 
understand better how the tones of lyrics in tonal 
languages should align with the music/sung voice, 
we record both sung and spoken voice of a piece 
of lyrics from one of the most popular songs in 
Mandarin, transform the sound into the frequency 
space, and compare the shape of the sound with 
that of the music in Figure 9. The original music of 
the chosen song is from an American song “Dream- 
ing of Home and Mother”, and was rewritten in 
Mandarin. Despite that this is not a translation task 
and do not have to convey the original meaning, 
we can see how the tonal contour of the lyrics in 
Mandarin align with that of music. 


A.2 Acceptable Pitch Transition Directions 
Table 


In Section 2.2, we explain that in practice, the rel- 
ative relationship of the pitch of the tones of the 
successive syllables/characters that belongs to the 
same word affect the most to the singability and 
intelligibility. And we summarize the acceptable 
transition directions in Figure 5 under the assump- 
tion that only relative relationship of successive 
notes matters. It should be noted that we intend 
merely to provide a workable solution but not a 
perfect one. For example, the handle of the fourth 
notes of Mandarin is actually very tricky. It is a 
continuous fall with a large range (see Figure 2), 
therefore it doesn’t represent one note. If it were 
to be represented by one note, it might represent 
the onset or offset part of the tone, and the falling 
trend is hinted by the pitch contour with proceeding 
and/or following note (Zhuang, 1982; Yu, 2021). 


Appendix B 
B.1 Training Details 


We pretrain our transformer-based model with re- 
construction objective and corrupt our input se- 
quence with text infilling (Lewis et al., 2020). 
More detailed pretraining hyper-parameters can 
be found in Table 4. 


Parameter Value 
encoder layer 12 
decoder layer 12 
max source position 512 
max target position 512 
layernorm embedding True 
criterion label smoothed cross entropy 
learning rate 3e-4 
label smoothing 0.2 
min Ir le-9 


Ir scheduler inverse sqrt 


warmup updates 4000 
warmup initial Ir le-7 
optimizer adam 
adam epsilon le-6 
adam betas (0.9, 0.98) 
weight decay 0.01 
dropout 0.1 
attention dropout 0.1 
text infilling 
mask rate 0.3 
poisson lambda 3.5 
replace length 1 


Table 4: Pretraining hyper-parameters 


Appendix C 


C.1 Human Evaluation Instruction 


In this paper, we conduct subjective evaluations 
by collecting annotations about the qualities of the 
translated songs from music school students (Sec- 
tion 5.4.2). 


C.2 Music Sheets 


As describe in Section 5.4.2 and shown in the in- 
structions (Figure 10), we distributes music sheets 
of the translated songs to the annotators. All music 
sheets can be found on https: //gagast.github. 


io/posts/human_eval. 
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Melody Pitch Contour 
C 


I’m singing in the rain (original lyrics) 


Pitch Contour =~ 


Spoken in English i 


wo zai yu zhong gé chang 


R TER P RIE (prose translation) 
Pitch Contour Q ==> 


Spoken in Mandarin f 5 ogM h 


Figure 8: The pitch contour of the prose translation (bottom line, in Mandarin) of lyrics do not match that of the 
original music (upper line). The directions showed in figure is estimated by the base frequency of spoken sound by 
text-to-speech tools. Such mismatch in pitch contour makes the sung lyrics sound unnatural and hard to understand. 


a z ra — 
oe = oe p a g 
cháng ting wai gŭ dào biān 
2 = ` ` 
K = 5h ia ig i 
Long pavilion outside ancient lane side 


450hz 
350 hz 
Sung 


250 hz 


150 hz 


450hz 
350 hz 
Spoken 


250 hz 


150 hz 


Figure 9: An example of a piece of a popular rewritten song in Mandarin “Farewell (song bié)”. The original music 
is from an American song “Dreaming of Home and Mother”. We record the sung and spoken voice and plot the 
actual base frequency of the sound. We can see how the tone shape and overall tonal contour aligns with the sung 
voice (by the music pitch). 
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Instruction 
Remark: We numerate the results from 


1) The sheet music for songs to be evaluated can be found in : different translation models randomly. 
/{Song_Name] You should not assume that all 
— 1.pdf : sheet music for translation 1 “translation 1” are generated from the 
— 2.pdf : sheet music for translation 2 same model. 


2) Criteria (1: bad 2: poor 3: fair 4: good 5: excellent) 


a. sense: How close the meaning of the translated line is to that the original line of lyric. The translation could be paraphrase. 
Remark: You should look into the context to identify the actual meaning of each line. 
[The lines with successive numeration are successive lines in lyrics] 


b. style: Whether the translated line looks like song-text style, as opposed to prose text. 


c. listenability: How well does it sound if the translated line is sung with given melody. 
[See corresponding sheet music] 


d. intelligibility: Can you understand the sung words? (Would you misheard the lyrics when it is sung) 
[See corresponding sheet music] 


Remark: We accept errors. Try not give bad if minor error occurred. 
For example, 1-bad would be “do not fit at all”; 5-excellent would be “fit almost perfectly”; 
3-fair would be “fit 50%-70%” 


Figure 10: Instructions for human evaluation 
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